The introduction to open data science course (IODS) will give us students an opportunity to get familiar with R programming language and some open software tools (Rstudio, R markdown and Github). Then later during the course, we will learn how these tools can be used in statistical data analysis.
The R programming code and the tools we are using during this course are all new to me. I have little background in statistics, but I took those courses long time ago, so I don’t remember much about that topic either. In addition, I am a new mac user with very limited computer handling skills, so during this first week I have had some extra work trying to solve beginner’s issues such as how to download and install programs etc. Luckily there was a good amount of instructions available in the IODS-course page, including videos, so everything should be now installed and working, at least I hope so.
The first week’s first assignment was to learn the very basics about the R programming language by using the DataCamp platform and passing the R introduction course: R short and sweet. This DataCamp course was actually straightforward and easy to pass if one was reading instructions carefully. Passing this took me quite a long time (because everything was so new to me, I had to read instructions more than once), but I had such fun studying through this short course, that I felt little sad as it ended after two chapters.
The other assignment for the week was to learn to use Rstudio and Github programs, by writing and modifying the output of this diary page. To do that I had to download my “forked” version of the course datafile and open them with my local GitHub program. After that I was able to modify the files in the Rstudio and comment the changes through the GitHub.
In this week’s R exercise we are making the regression analysis from the questionare data about students’ learning methods (deep and surface learning) and attitudes toward the learning progress and study subject. The answers are scaled and combined together with final exam points, overall attitudes toward statistics, gender and age.
Before the analysis, we did data wrangling step (where the data table is converted into different type of file for analysis), and in this step the data was already filtered so that those answers that got 0 points are left out from the data table, the deep, strategic and learning question points were averiged and the attitude points were divided by the number of questions (about attitude).
Tables 1-3. Data tables about summary of data and its structure. Data table has seven different variables. The first variable is gender, which can be either female or male. The next one is age, and has integer value, as well as the exam points in the table. Other points in the table are averiged values from the multiple questions from that spesific topic.
# Reading the data into R (students2014) from the local data wrangling file (learning2014.txt). Setting the header=TRUE because file contains the names of the variables as its first line.
students2014<-read.table("~/Tiedostot_J/IODS-project/data/learning2014.txt", header=TRUE)
#Analysis of the data (structure and summary)
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(students2014)
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Plot 1. To summarize we have made a plot for attitude vs exam points. These results are divided on two colours based on the gender and lastly regression line is drawn to the plot to visualize if there is a gender difference.
#Accessing libraries ggplot2 and GGally
library(ggplot2)
library(GGally)
# Plot with ggplot (aes)
p1 <- ggplot(students2014, aes(x = attitude, y = points, col = gender))
# point visualization
p2 <- p1 + geom_point()
# adding a regression line
p3 <- p2 + geom_smooth(method = "lm")
# adding a title and drawing the plot
p4 <- p3 + ggtitle("Student's attitude vs exam points")
# printing the plot
p4
Plot 2. Plot about the summaries of the variables in the data. In this plot there is easily seen the different relationships between the variables.
# Drawing plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=2), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Table 4. In this table, there is actual linear regression model for multiple variables used. The exam points are the target of the analysis and it gives the model to us as well as the details about the residuals in the model. From coefficients we can see where it crosses the Y-axis (points, intercept) and the model also estimates for each variable, if it is significant from zero. In other words, does that variable make a difference to the points (P and t values). The attitude seems to have the highest impact to exam points between these three variables.
The multiple R squared of the model tells us how well this data fits into multiple linear regression model. It determines how well the residuals fit the theoretical model values. More we have outliers from the model, the lower we have R squared value of the model telling us that some other model would be needed to explain our data. In our case, it seems that there is quite a lot variability of the data, meaning that it is not that strong dependence on the exam points and between attitude, age and strategic learning.
# creating a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + age + stra, data = students2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + age + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Plot 3. Analysis of these diagnostic plots shows in the first plot again how well the residuals are corresponding the model fitted values. It should be scattered and not showing any kind of trend in this plot. The second plot is about normal QQ which is normality of errors. It should show no trend again going on some direction from the trendline. The third plot is about the leverage and residuals, and it takes into account that some residuals might have a larger effect on a model fitting than other. For example, large values might have larger effect than smaller ones when regression curve is fitted.
# drawing diagnostic plots with plot(). Choosing the plots 1, 2 and 5. Par to show multiple plots at the same time (room for four plots, only 3 used here)
par(mfrow=c(2,2))
plot(my_model, which=c(1:2,5))
The data for this week’s analysis is from the UCI Machine Learning Repository: link to the data source.
The data has 33 different variables (questions). Many variables are binary values, such as school, which is answered by the student either as GP (Gabriel Pereira) or MS (Mousinho da Silveira). Also family size is binary: LE3 is less or equal to 3, GT3 is greater than 3. Some binary variables are just answered either yes or no. There are some other variables that are answered by number, such as age that is answered by integer, number from 15-22. Also some questions are answered nominally, such as mother’s job. In the end there is G1, G2 and G3, which are the numeric grades from periods 1, 2 and the final grade from student’s course subject, Math or Portuguese. There are many background questions and questions related to studying, but also questions about alcohol consumption during the workdays and weekends.
The largest difference from the original data and the data table presented here (Table 1) is the two extra columns we have added here after grade variables, which are about alcohol use and high use of alcohol. The alcohol use combines the two original questions Dalc (workday alcohol consumption) and Walc (weekend alcohol consumption), which were estimated in numbers from 1-5, 1 being low use and 5 high, and gives an average of those. The high use is true for students whose combined alcohol use was more than 2. So in our data table we have 35 variables instead of 33 original ones. Also the data has been modified in the data wrangling the two different questionnaires have been merged to one, combining the data files and excluding the possible duplicate answers if one student has taken both questionnaires.
Table 1. Variables in the combined data file.
alc<-read.table("~/Tiedostot_J/IODS-project/data/alc.txt", sep="", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of our analysis is to study relationships between some variables in the data set and alcohol consumption. For this purpose, I have chosen four variables that could have been effected by how much student is using alcohol in his/her life. Two variables are likely to be positively affected by low alcohol consumption and values of two variables should increase if the student consumes more alcohol. The low alcohol consumption could possible increase the quality of family relationships and weekly study time. The high alcohol consumption could have an effect on amounts of absences in school and going out more often with friends. So, I have chosen the following four variables for further analysis:
These variables will be analyzed with high usage of alcohol, so the first graph should show low relationship with family when students are using high amounts of alcohol (TRUE in the graph), another should be also lower than for the students not using high amounts of alcohol (FALSE). In the two last figures it should be other way around, for high alcohol consumption the box should be higher than for low alcohol consumers.
In the data table 1 there is the number of those students who are consuming higher amounts of alcohol (high alcohol consumption = TRUE) and those who are consuming less (high alcohol consumption = FALSE). In the data table 2 there are the mean values of family relationships in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).
Boxplot 1. An effect of student’s alcohol consumption on family relationships.
#access to libraries tidyr, dplyr and ggplot2. Had to suppress dplyr startupmessage.
suppressPackageStartupMessages({
library("dplyr")
})
library(dplyr)
library(tidyr)
library(ggplot2)
# produce summary statistics by group. How many from students are in the high alcohol consumption group?
alc %>% group_by(high_use) %>% summarise(count = n())
## # A tibble: 2 × 2
## high_use count
## <lgl> <int>
## 1 FALSE 268
## 2 TRUE 114
# What is the mean value of family relationship values in low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(famrel))
## # A tibble: 2 × 2
## high_use mean_grade
## <lgl> <dbl>
## 1 FALSE 4.003731
## 2 TRUE 3.780702
# initialize a box plot of high_use and famrel, studytime, goout and absences.
g1<-ggplot(alc, aes(x=high_use, y=famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("Quality of family relationships")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on his/her family relationships")
So it seems that the alcohol consumption possibly very small effect on quality of the family relationships. For those who did not use high amounts of alcohol estimated their family relationships to 4.0 (average) and those who used high amounts, the average value was 3.8.
In the data table 3 there are the mean values of the time the students are using in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).
Boxplot 2. An effect of student’s alcohol consumption on time used studying.
#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)
# What is the mean time used for studies in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(studytime))
## # A tibble: 2 × 2
## high_use mean_grade
## <lgl> <dbl>
## 1 FALSE 2.149254
## 2 TRUE 1.771930
g2<-ggplot(alc, aes(x=high_use, y=studytime))
g2 + geom_boxplot() + ylab("Weekly study time")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on his/her study time")
The average study time in the group of low alcohol consumers was 2.1 hour, when it was 1.8 in the high amount consumer group. Again it seemed to have a small effect as was suspected.
In the data table 4 there are the mean values of the time the students are spending out with their friends in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).
Boxplot 3. An effect of student’s alcohol consumption on time spended out with their friends.
#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)
# What is the going out with friends in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(goout))
## # A tibble: 2 × 2
## high_use mean_grade
## <lgl> <dbl>
## 1 FALSE 2.854478
## 2 TRUE 3.719298
g3<-ggplot(alc, aes(x=high_use, y=goout))
g3 + geom_boxplot() + ylab("Go out with friends")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on going out with friends")
Again it seems to have small logic, those who consume alcohol more, tend to spend more time out with their friends. So this time the TRUE group had higher points than the FALSE group (2.9 vs 3.7).
In the data table 5 there are the mean values of the times they have been absent from the class with the same groups as previously (TRUE or FALSE on high alcohol consumption).
Boxplot 4. An effect of student’s alcohol consumption on times of absence from school.
#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)
# What is the going out with friends in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(absences))
## # A tibble: 2 × 2
## high_use mean_grade
## <lgl> <dbl>
## 1 FALSE 3.705224
## 2 TRUE 6.368421
g4<-ggplot(alc, aes(x=high_use, y=absences))
g4 + geom_boxplot() + ylab("Number of school absences")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on school absences")
So it seems that alcohol consumption had the highest impact on the absence from school. The group of high alcohol usage had been on average more than 6 times away from the school, whereas the other group was away less than 4 times on average.
Then the the relationships between the chosen variables and binary high/low alcohol consumption is explored statistically by logistic regression.
Table 6. The summary of fitted model and Table 7. the coefficients of the model.
m<-glm(high_use~famrel+studytime+goout+absences, data=alc, family="binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8701 -0.7738 -0.5019 0.8042 2.5416
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28606 0.70957 -1.812 0.06992 .
## famrel -0.33699 0.13681 -2.463 0.01377 *
## studytime -0.55089 0.16789 -3.281 0.00103 **
## goout 0.75953 0.12041 6.308 2.82e-10 ***
## absences 0.06753 0.02175 3.104 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 384.07 on 377 degrees of freedom
## AIC: 394.07
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) famrel studytime goout absences
## -1.28606058 -0.33699130 -0.55089391 0.75953025 0.06753071
Also the logistic regression and coefficients of the model backs up the previous analysis. Coefficients are positive with the last two variables (goout and absences) but negative with the first ones (famrel) and (studytime).
Next the oddsratios and confidential intervals for the model are computed (Table 8).
#compute oddratios and confidential intervals
OR<-coef(m)%>%exp
CI<-confint(m)%>%exp
## Waiting for profiling to be done...
#Print these out together
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2763573 0.06723596 1.0961732
## famrel 0.7139151 0.54460646 0.9331198
## studytime 0.5764343 0.41040872 0.7941804
## goout 2.1372720 1.69853389 2.7261579
## absences 1.0698631 1.02591583 1.1187950
Finally we are testing the predictive power of the model by 2x2 cross tabulation of predictions versus the actual values and predictions.
#predict the probability of high_use
probabilities<-predict(m, type="response")
#add the predicted probabilities to alc
alc<-mutate(alc, probability=probabilities)
#use the probabilities to make a prediction of high use
alc<-mutate(alc, prediction=probability>0.5)
#Tabulate the target variable vs the predictions
table(high_use=alc$high_use, prediction=alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 242 26
## TRUE 65 49
#Initialize a plot of high use vs probability in alc
g<-ggplot(alc, aes(x=probability, y=high_use, col=prediction))
#Define the geom as point and draw the plot
g+geom_point()
#Tabulate the target variable vs the predictions
table(high_use=alc$high_use, prediction=alc$prediction)%>% prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63350785 0.06806283 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.80366492 0.19633508 1.00000000
Lastly, the total proportion of of inaccurately classified individuals (= the training error) is presented in the table below.
#define mean prediction error
loss_func<-function(class, prob){
n_wrong<-abs(class-prob)>0.5
mean(n_wrong)
}
#Call loss_func to cumpute the average number of wrong predictions
loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2382199
In this week’s Rstudio exercise we will use the Boston data set. It has 506 objects and 14 different variables, so the dataset has 506 rows and 14 columns.
The columns in Boston data set are:
# Load the Boston data from MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
# Explore the structure and dimension of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The structure view in table above shows that Boston data contains only numerical values.
In the plot below, the linear correlation between variables is roughly determined with scatterplot matrix. In the diagonal line from top left to bottom right there are Boston data variables and then each variable is plotted against each other.
In the next data table there are summaries of the variables. It seems that some of the variables have quite large distribution, such as crim and zn, while others are quite similar, ptratio for example. The values vary from 0 to 711, so they need to be scaled, if we wish to handle them simultaniously in the analysis.
Below these the correlation of the variables are more examined with correlation plot and visualized plot of this data. The visualized correlation plot shows nicely which variables are correlated and which are not.
# Access to dplyr and corrplot
library(dplyr)
library(corrplot)
# Scatterplot matrix of the variables
pairs(Boston)
# Summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# For correlation, a "nicer"" correlation matrix is made wit corrplot
cor_matrix<-cor(Boston)
# Print the correlation matrix (values rounded, first 2 digits)
cor_matrix%>%round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
The data is next standardized. The variables changed now so that values are now variating from -3.9 to 9.9. Standardization is done in the scaling so that the column means are substracted from their columns and the resulting difference is then divided with standard deviation.
# Center and standardize variables
boston_scaled<-scale(Boston)
# Summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# Boston_scale is changed to data frame format
boston_scaled<-as.data.frame(boston_scaled)
# create a categorical variable of the crime rate
# Creat scaled_crim
scaled_crim<-boston_scaled$crim
# create a quantile vector of crim
bins<-quantile(scaled_crim)
# create the categorical variable ´crime´
crime<-cut(scaled_crim, breaks=bins, include.lowest=TRUE, label=c("low", "med_low", "med_high", "high"))
# remove the original crim from the dataset
boston_scaled<-dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled<-data.frame(boston_scaled, crime)
# splitting the Boston data set to test and train sets
# number of rows in data set
n<-nrow(boston_scaled)
# choose randomly 80% of the rows
ind<-sample(n, size=n*0.8)
# create train and test sets
train<-boston_scaled[ind,]
test<-boston_scaled[-ind,]
The linear discriminant analysis is then fit to the train set. The categorical crime rate is the target variable and all the other variables in the dataset are predictor variables.
# LDA on the train set, categorical crime rate as target variable, all other variables as predictor variables.
lda.fit<-lda(crime~., data=train)
# target class as numeric
classes<-as.numeric(train$crime)
# draw the LDA (bi)plot, arrows added
plot(lda.fit, dimen=2, col=classes, pch=classes)
Next the crime categories are saved to test data and then the categorical crime variable is removed from the test dataset. After this prediction of the classes is made with LDA model and results are cross tabulated with the crime categories from the test set. From the cross tabulation, the prediction of teh model (prediction as probabilities), it seems that high prediction is the most accurate one and next is the low. Then the medium ones are not that well predicted.
# save the crime categories from the test data
correct_classes<-test$crime
# remove the categorical crime from test data
test<-dplyr::select(test, -crime)
# precict classes with test data
lda.pred<-predict(lda.fit, newdata=test)
# cross tabulate the results
table(correct=correct_classes, predicted=lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 11 1 0
## med_low 6 20 2 0
## med_high 1 9 13 0
## high 0 0 0 26
Lastly, the distances between the observations are calculated and k-means algorith is run on the dataset.
# reload Boston data set
data(Boston)
# center and standardize variables
boston_scaled<-scale(Boston)
# calculate the distances between observations
dist_eu<-dist(Boston)
# k-means clustering
km<-kmeans(dist_eu, centers=15)
# plot the dataset with clusters
pairs(Boston, col=km$cluster)
Based on the plot, the optimal number of clusters is 2, because then the total WCSS drops radically in the figure and the algorithm is rerun with this. The clusters are then visualized with a plot where clusters are separated with colors and results then interpreted.
# set.seed to prevent producing different results everytime
set.seed(123)
# calculate the distances between observations
dist_eu<-dist(Boston)
# determine the number of clusters
k_max<-10
# calculate the total WCSS
twcss<-sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
# k-means clustering
km<-kmeans(dist_eu, centers=2)
# plot the dataset with clusters
pairs(Boston, col=km$cluster)
This week we have used the data from United Nations Development Programme, Human Development Reports (hdr.undp.org).
From the reports, the two data sets of Human Development Index (HDI) and Gender Inequality Index were downloaded, combined, and some data variables were selected for this week’s analysis exercise.
# Load the human.txt data into R.
human<-read.table("~/Desktop/IODS-project/data/human.txt", sep="", header=T)
# Explore the structure and the dimensions of the data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
In our data set we have 155 objects and 8 variables, so we have 155 rows (different countries) and 8 columns. All the variables are recognized as numerical, most of them have decimal values, but the GNI and Mat.Mor are integers. The chosen 8 variables (column heading abbreviations) from larger data set describe following things:
# Show a graphical overview of the data
# Access to ggplot2 and tidyr
library(ggplot2)
library(tidyr)
# For easier overview of the data, the data set is scaled and class changed from numeric to data frame.
human_scaled<-scale(human)
human_scaled<-as.data.frame(human_scaled)
# Next scaled data is gathered to key-value pairs and plotted for visualization of the data.
gather(human_scaled)%>%ggplot(aes(value))+facet_wrap("key", scales="free_y", )+geom_histogram(bins=155, col="darkblue")+xlim(-6,6)
# To see how scaling affected another histograms are made for the original data.
gather(human)%>%ggplot(aes(value))+facet_wrap("key", scales="free", )+geom_histogram(bins=155, col="darkred")
From the blue histograms (scaled values) it is perhaps easier to see, which variables are normally distributed, and which are not. Ado.Birth, GNI and Mat.Mor are not following the normal distribution. From red histograms (not-scaled histograms) it is obvious that data set doesn’t have negative values, which makes sense, because none of the variables in the data table can have negative value.
The summary tables below are first for not-scaled data and the next one is for the scaled data. These two tables (and histograms above) show that largest distribution is with GNI variable and least distributed is Life.Exp.
# Creating summary tables for both scaled and original human data
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
summary(human_scaled)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
The relationships between variables can be visualized by creating correlation plot. In this first correlation plot below, the distribution of the data variables is again visible in the middle of the plot (from higher left to lower bottom). The highest correlation value can be seen between variables Life.Exp and Edu.Exp. This is positively correlated, so in countries where expected years of schooling is high, the life expectancy is high also. Negatively correlated is the Mat.Mor and Life.Exp, so in those places, where life expectancy is high, the maternal mortality ratio is low. Least correlated were Edu2.FM and Labo.FM, so the proportion of females divided with proportion of males with at least secondary education and the proportion of females divided with the proportion of males in the labour force were not correlated. The same observations can be seen from the second correlation plot below.
# Access GGally and corrplot
library(GGally)
library(corrplot)
# Visualize data variables with ggpairs()
ggpairs(human)
# Compute correlation matrix and visualize it with corrplot()
cor(human)%>%corrplot(order ="hclust", method="color", type="lower")
The not standardized human data is then analyzed with PCA (principal component analysis). The variability of this data is almost completely from component 1 (99.99% of variability) and the second component (orthogonal to PC1) makes the 0.01% variation in the data. The other principal components don’t have importance in terms of captured variance. So, we have the two principal components (uncorrelated variables) which will capture the highest variation. The angles of arrows, however, are all looking the same, so this is not giving us much information about the correlation, because all the arrows are small angle with PC1 axis, meaning that the correlation is high between the features and PC1. The length of the arrow represents the standard deviation of the feature.
# Perform principal component analysis on the not standardized human data (with the SVD method)
pca_human<-prcomp(human)
# Create and print summary of pca_human
s<-summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# Rounded percentages of variance captured by each PC, print them out
pca_pr<-round(1*s$importance[2,]*100, digits=2)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
# Create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), "(", pca_pr, "%)")
# Draw a biplot (and suppress warnings for zero-lenght arrows)
suppressWarnings(biplot(pca_human, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab=pc_lab, ylab=NA, main="Biplot using not standardized human data"))
Next the PCA is done for the standardized variables. Now we can get more principal components. Comparing not-standardized and standardized PCAs, it is clear that PCA is sensitive to the scaling of original values, assuming that features that have larger variance are more important. From this standardized biplot it is clearly visible that Labo.FM and Parli.F are their own group, as well Mat.Mor and Ado.Birth are, and all the rest 4 variables are in the grouped on the left side of this biplot. So this means that maternal mortality ratio and adolescent birth rate are somehow related onto similar features and they are quite opposite of the high life expectancy at birth, expected years of schooling for children of school entering age, gross national income per capita and the proportion of females divided with proportion of males with at least secondary education. In other words, if we think about countries, those countries are more left which have higher education (also for girls), population health, and their people have more money to use. On the right are the countries which have higher mortality ratio for mother in childbirth and under aged girls are becoming mothers. On the top there are countries with larger number proportion of females in work life or having place in the parliament, while below part of the biplot there are less females than males in the parliament and work.
# Perform principal component analysis on the standardized human data (with the SVD method)
pca_human_scaled<-prcomp(human_scaled)
# Create and print summary of pca_human
s_s<-summary(pca_human_scaled)
s_s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# Rounded percentages of variance captured by each PC, print them out
pca_pr2<-round(1*s_s$importance[2,]*100, digits=1)
pca_pr2
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# Create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr2), "(", pca_pr2, "%)")
# Draw a biplot for standardized data
biplot(pca_human_scaled, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab=pc_lab, ylab=NA, main="Biplot using standardized human data")
Based on the biplot drawn after PCA on the standardized human data shows that the first two principal components PC1 is about how developed the country is (health, economical and educational point of view) and PC2 is mainly describing the role of the woman in the country (does women stay home or work and do they have a voice in country’s political system). Because the angle between Parli.F or LaboFM is large between these two arrows and PC1, they are not correlated how developed the country is economically. From arrows it is possible to divide arrows into three different groups. Inside these groups the correlation between the variables (features) is rather small (small angle), but between the groups it is high (large angle). Inside these groups, Parli.F and Labo.FM have the largest angle between those their arrows, which means that their relationship is less correlated than features inside other two groups. The length of the arrows is quite same, perhaps a bit smaller are the Edu2.FM and GNI. In addition, these two last mentioned features have the highest correlation between them (smallest angle). If we return to first correlation tests coming from correlation plot, it told us that highest correlation would be Edu.Exp and Life.Exp, which is not quite same but close what we get from the pca biplot and arrows. However, Mat.Mor and Life.Exp seems to be negatively correlated as they were in the correlation plot and Edu2.FM and Labo.FM have very close to 90 degrees angle between their arrows, which means that they are the least correlated arrows (features) in this analysis.
The tea data set is loaded from the package FactoMineR. In this data set is a questionnaire on tea, where 300 individuals were asked how they drink their tea with 18 questions. Besides that there were questions about their product’s perception with 12 questions and 4 questions about their personal details. So there is 36 columns and 300 rows, rows are the individuals that answered the questionnaire and columns are the different questions. The first 18 questions are active, the age question is a supplementary quantitative variable and the last questions are supplementary categorical variables.
# Access to FactoMineR (includes tea data) and dplyr (suppress warnings)
suppressPackageStartupMessages({
library("dplyr")
})
library(FactoMineR)
library(dplyr)
# The names of the datasets in the FactoMineR
data(package="FactoMineR")
# Load tea dataset from the package
data(tea)
# Summary and structure (questions) of the tea dataset
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
In above there is a structure of tea data. It has the 300 objects and 36 variables, as it should have. The questions are the columns in the data set and they are factors with 2 or more levels meaning that the individuals need to choose the answer from given 2 or more options. Only difference is the age, where the answer needs to be integer.
If we first take a look at those 18 first questions about how individuals have their tea. The first group of bar plots is about these questions. Then next six are personal questions and about the age group and frequency of having tea, and lastly group of 12 bar plots are about the product’s perception.
# Dividing data for groups for analysis (tea_how, personal information, perception)
keep_columns1<-c("breakfast", "tea.time", "evening", "lunch","dinner","always","home","work","tearoom","friends","resto","pub","Tea","How","sugar","how","where","price")
keep_columns2<-c("age","sex", "SPC","Sport","age_Q","frequency")
keep_columns3<-c("escape.exoticism", "spirituality", "healthy", "diuretic", "friendliness", "iron.absorption", "feminine", "sophisticated", "slimming", "exciting", "relaxing")
tea_how<-dplyr::select(tea, one_of(keep_columns1))
tea_inf<-dplyr::select(tea, one_of(keep_columns2))
tea_per<-dplyr::select(tea, one_of(keep_columns3))
# visualize the how dataset (and suppress some warnings)
suppressWarnings(gather(tea_how)%>%ggplot(aes(value))+facet_wrap("key", nrow=3, scales="free")+geom_bar(fill="darkblue")+theme(axis.text.x=element_text(angle=45, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))
suppressWarnings(gather(tea_inf)%>%ggplot(aes(value))+facet_wrap("key", scales="free")+geom_bar(fill="darkred")+theme(axis.text.x=element_text(angle=90, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))
suppressWarnings(gather(tea_per)%>%ggplot(aes(value))+facet_wrap("key", scales="free")+geom_bar(fill="darkgreen")+theme(axis.text.x=element_text(angle=45, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))
Then Multiple Correspondence Analysis is done for the whole tea data. First there is the summary of the mca and then two separate variable biplots.
# MCA for tea_how, having
mca_tea<-MCA(tea, quanti.sup=19, quali.sup=20:36, graph=F)
# Summary for the models
summary(mca_tea)
##
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = 20:36, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.148 0.122 0.090 0.078 0.074 0.071
## % of var. 9.885 8.103 6.001 5.204 4.917 4.759
## Cumulative % of var. 9.885 17.988 23.989 29.192 34.109 38.868
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.068 0.065 0.062 0.059 0.057 0.054
## % of var. 4.522 4.355 4.123 3.902 3.805 3.628
## Cumulative % of var. 43.390 47.745 51.867 55.769 59.574 63.202
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.052 0.049 0.048 0.047 0.046 0.040
## % of var. 3.462 3.250 3.221 3.127 3.037 2.683
## Cumulative % of var. 66.664 69.914 73.135 76.262 79.298 81.982
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.038 0.037 0.036 0.035 0.031 0.029
## % of var. 2.541 2.438 2.378 2.323 2.055 1.915
## Cumulative % of var. 84.523 86.961 89.339 91.662 93.717 95.633
## Dim.25 Dim.26 Dim.27
## Variance 0.027 0.021 0.017
## % of var. 1.821 1.407 1.139
## Cumulative % of var. 97.454 98.861 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.541 0.658 0.143 | -0.149 0.061 0.011 | -0.306
## 2 | -0.361 0.293 0.133 | -0.078 0.017 0.006 | -0.633
## 3 | 0.073 0.012 0.003 | -0.169 0.079 0.018 | 0.246
## 4 | -0.572 0.735 0.235 | 0.018 0.001 0.000 | 0.203
## 5 | -0.253 0.144 0.079 | -0.118 0.038 0.017 | 0.006
## 6 | -0.684 1.053 0.231 | 0.032 0.003 0.001 | -0.018
## 7 | -0.111 0.027 0.022 | -0.182 0.090 0.059 | -0.207
## 8 | -0.210 0.099 0.043 | -0.068 0.013 0.004 | -0.421
## 9 | 0.118 0.031 0.012 | 0.229 0.144 0.044 | -0.538
## 10 | 0.258 0.150 0.045 | 0.478 0.627 0.156 | -0.482
## ctr cos2
## 1 0.347 0.046 |
## 2 1.483 0.409 |
## 3 0.224 0.038 |
## 4 0.153 0.030 |
## 5 0.000 0.000 |
## 6 0.001 0.000 |
## 7 0.159 0.077 |
## 8 0.655 0.174 |
## 9 1.070 0.244 |
## 10 0.861 0.158 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.166 0.495 0.025 2.756 | -0.166 0.607 0.026 -2.764
## Not.breakfast | -0.153 0.457 0.025 -2.756 | 0.154 0.560 0.026 2.764
## Not.tea time | -0.498 4.053 0.192 -7.578 | 0.093 0.174 0.007 1.423
## tea time | 0.386 3.142 0.192 7.578 | -0.072 0.135 0.007 -1.423
## evening | 0.319 1.307 0.053 3.985 | -0.058 0.053 0.002 -0.728
## Not.evening | -0.167 0.683 0.053 -3.985 | 0.030 0.028 0.002 0.728
## lunch | 0.659 2.385 0.075 4.722 | -0.390 1.018 0.026 -2.793
## Not.lunch | -0.113 0.410 0.075 -4.722 | 0.067 0.175 0.026 2.793
## dinner | -0.661 1.146 0.033 -3.136 | 0.796 2.025 0.048 3.774
## Not.dinner | 0.050 0.086 0.033 3.136 | -0.060 0.152 0.048 -3.774
## Dim.3 ctr cos2 v.test
## breakfast | -0.483 6.900 0.215 -8.017 |
## Not.breakfast | 0.445 6.369 0.215 8.017 |
## Not.tea time | 0.265 1.886 0.054 4.027 |
## tea time | -0.205 1.462 0.054 -4.027 |
## evening | 0.451 4.312 0.106 5.640 |
## Not.evening | -0.236 2.254 0.106 -5.640 |
## lunch | 0.301 0.822 0.016 2.160 |
## Not.lunch | -0.052 0.141 0.016 -2.160 |
## dinner | 0.535 1.235 0.022 2.537 |
## Not.dinner | -0.040 0.093 0.022 -2.537 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.025 0.026 0.215 |
## tea.time | 0.192 0.007 0.054 |
## evening | 0.053 0.002 0.106 |
## lunch | 0.075 0.026 0.016 |
## dinner | 0.033 0.048 0.022 |
## always | 0.045 0.001 0.101 |
## home | 0.005 0.000 0.134 |
## work | 0.112 0.043 0.005 |
## tearoom | 0.372 0.022 0.008 |
## friends | 0.243 0.015 0.103 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3
## F | 0.151 0.033 3.158 | -0.109 0.017 -2.278 | -0.048
## M | -0.221 0.033 -3.158 | 0.159 0.017 2.278 | 0.070
## employee | -0.153 0.006 -1.313 | -0.151 0.006 -1.289 | 0.103
## middle | -0.030 0.000 -0.205 | 0.336 0.017 2.281 | -0.284
## non-worker | -0.036 0.000 -0.324 | 0.185 0.009 1.666 | -0.291
## other worker | 0.040 0.000 0.187 | 0.013 0.000 0.061 | -0.063
## senior | 0.415 0.023 2.608 | 0.072 0.001 0.452 | -0.187
## student | 0.032 0.000 0.305 | -0.317 0.031 -3.022 | 0.394
## workman | -0.417 0.007 -1.473 | 0.249 0.003 0.878 | 0.343
## Not.sportsman | -0.030 0.001 -0.426 | 0.018 0.000 0.260 | -0.051
## cos2 v.test
## F 0.003 -0.998 |
## M 0.003 0.998 |
## employee 0.003 0.884 |
## middle 0.012 -1.928 |
## non-worker 0.023 -2.620 |
## other worker 0.000 -0.289 |
## senior 0.005 -1.177 |
## student 0.047 3.760 |
## workman 0.005 1.209 |
## Not.sportsman 0.002 -0.721 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.033 0.017 0.003 |
## SPC | 0.032 0.053 0.076 |
## Sport | 0.001 0.000 0.002 |
## age_Q | 0.008 0.077 0.146 |
## frequency | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality | 0.005 0.000 0.016 |
## healthy | 0.000 0.000 0.008 |
## diuretic | 0.004 0.000 0.013 |
## friendliness | 0.071 0.001 0.013 |
##
## Supplementary continuous variable
## Dim.1 Dim.2 Dim.3
## age | 0.042 | 0.204 | -0.340 |
# Visualize mca_tea
plot(mca_tea, invisible=c("ind"), cex=0.8, selectMod="contrib 20", unselect="grey30")
# Another visualization of the mca_tea
plot(mca_tea, choix="var", xlim=c(0,0.5), ylim=c(0,0.5), cex=0.7)
From the summary table one can read that the dimension 1 is explaining approximately 10% of the variance and the second dimension around 8%. Categorical variables are the squared correlation between each variable and dimension. If the value is close to 1, then there is a strong link between the current dimension and variable. From the summary table, the strongest link is between tearoom and dimension 1 (from 10 most correlated categorical variables). This is best visualized in the second biplot, where tearoom stands on the far right on the x-axis.
In the first MCA biplot there is no group of individuals, only variables. In the x-axis, there is dimension 1, and Y-axis is dimension 2. The 20 most contributed variables are written in red color, other variables are just grey triangles in the MCA biplot.
The second biplot is coloring the variables so in red you can see the active variables, in green the supplementary categorical variables and in blue the supplementary continuous variables. The coordinate for a categorical variable on a dimension is squared correlation ratio between the dimension and the categorical variable, where as the coordinate for a continuous variable on a dimension is a squared correlation coefficient between the dimension and the continuous variable. The distance between variable categories gives a measure in their similarity, so in the last plot it is seen that resto and friends are closer than Tea and friends.